NSF PAR Search | NSF Public Access Repository

Multi-Modal Augmentation for Large Language Models with Applications to Task-Oriented Dialogues

Samarinas, Chris; Promthaw, Pracha; Lekhwani, Rahul; Mysore, Sheshera; Huang, Sung Ming; Nijasure, Atharva; Zeng, Hansi; Zamani, Hamed (October 2023, 2nd Proceedings of Alexa Prize TaskBot (Alexa Prize 2023))

We introduce MarunaBot V2, an advanced Task-Oriented Dialogue System (TODS) primarily aimed at aiding users in cooking and Do-It-Yourself tasks. We utilized large language models (LLMs) for data generation and inference, and implemented hybrid methods for intent classification, retrieval, and question answering, striking a balance between efficiency and performance. A key feature of our system is its multi-modal capabilities. We have incorporated a multi-modal enrichment technique that uses a fine-tuned CLIP model to supplement recipe instructions with pertinent images, a custom Diffusion model for image enhancement and generation, and a method for multi-modal option matching. A unique aspect of our system is its user-centric development approach, facilitated by a custom tool for tracking user interactions and swiftly integrating feedback. For a demonstration of our system, visit https://youtu.be/4MNI-puv_eE.

Full Text Available

Search for: All records